Research questions / Problem statement:

Infectious diseases are a very important public health issue. So we want to examine overall communicable disease rates and trends over time of Infectious diseases reported in california. Sexually transmitted diseases will be analized separately from other groups of infectious diseases.

Datasets to be used :

1.Infectious Diseases by County, Year and Sex (in California)2001-2018 Source : https://data.chhs.ca.gov/dataset/infectious-disease Raw format of dataset: https://data.chhs.ca.gov/dataset/03e61434-7db8-4a53-a3e2-1d4d36d6848d/resource /75019f89-b349-4d5e-825d-8b5960fc028c/download/idb_odp_2001-2018.csv Name/source: CHHS Open Data Number of columns: 9 Number of rows: 154,344 Timing: The years included in this dataset is 2001 to 2018

2.STD’s in California by disease, county, year and sex. Dataset: case counts and rates for sexually transmitted diseases (chlamydia, gonorrhea, and all forms of syphilis) reported for California residents. https://data.chhs.ca.gov/dataset/stds-in-california-by-disease-county-year-and -sex

Name/Source: CHHS Open Data Number of Columns: 10 Number of Rows: 9,558 Timing: The years included in this dataset is 2001 to 2018

Creating variables for data analysis:

*We created new groups of variables to facilitate data presentation and analysis. The new groups of variables are:
1. Name of california region, for the 10 different California regions.

  1. Type of infectious disease : to group each of thereported diseases by “type of disease” , following conventional microbiology classification.

  2. We also grouped years in groups of 3.

California regions:

Superior <- “NEVADA”,“PLACER”,“PLUMAS”,“SACRAMENTO”,“SHASTA”,“SIERRA”, “SISKIYOU”,“SUTTER”,“TEHAMA”, “YOLO”, “YUBA”, “MODOC”, “EL DORADO”, “BUTTE”, “GLENN”, “LASSEN” North Coast <- “DEL NORTE”, “HUMBOLDT”, “LAKE”, “MENDOCINO”, “NAPA”,“SONOMA”, “TRINITY” Bay area<- “ALAMEDA”,“CONTRA COSTA”, “MARIN”, “SAN FRANCISCO”, “SAN MATEO”, “SANTA CLARA”, “SOLANO” North San Joaquin Valley <- “ALPINE”, “AMADOR”, “CALAVERAS”, “MADERA”,“MARIPOSA”, “MERCED”, “MONO”,“SAN JOAQUIN”, “STANISLAUS”, “TUOLUMNE” Central Coast <- “MONTEREY”, “SAN BENITO”, “SAN LUIS OBISPO”, “SANTA BARBARA”, “SANTA CRUZ”, “VENTURA” South San Joaquin Valley <- “FRESNO”,“INYO”, “KERN”, “KINGS”, “TULARE” Inland Empire<- “RIVERSIDE”, “SAN BERNARDINO” LA County <- “LOS ANGELES” Orange County <- “ORANGE” San Diego and Imperial County <- “IMPERIAL”, “SAN DIEGO” We will also have “California” as a total for the State.

#Groups of infectious diseases: 1. Parasitic <- c(“Amebiasis”,“Babesiosis”, “Cryptosporidiosis”, “Cyclosporiasis”, “Cysticercosis or Taeniasis”, “Malaria”, “Giardiasis”, “Trichinosis”) 2. Toxin_related <- c(“Botulism, Foodborne”,“Botulism, Other”, “Botulism, Wound”, “Ciguatera Fish Poisoning”, “Domoic Acid Poisoning”,“Paralytic Shellfish Poisoning”, “Scombroid Fish Poisoning”) 3. viral <- c(“Chikungunya Virus Infection”, “Dengue Virus Infection”,“Flavivirus Infection of Undetermined Species”,“Hantavirus Infection”,“Hepatitis E acute infection”,“Rabies, human”,“Yellow Fever”, “Zika Virus Infection”) prions <- c(“Creutzfeldt-Jakob Disease and other Transmissible Spongiform Encephalopathies”) 4. fungal <- c(“Coccidioidomycosis”) 5. Bacterial <- c(“Anaplasmosis”, “Anaplasmosis and Ehrlichiosis”, “Anthrax”, “Brucellosis”, “Campylobacteriosis”,“Cholera”,“E. coli O157”,“E. coli Other STEC (non-O157)”, “Legionellosis”,“Leprosy (Hansen’s Disease)”, “Leptospirosis”, “Listeriosis”, “Lyme Disease”,“Plague, human”,“Q Fever”,“Spotted Fever Rickettsiosis”, “Streptococcal Infection (cases in food and dairy workers)”, “Ehrlichiosis”, “Psittacosis”, “Salmonellosis”, “Shigellosis”, “Tularemia”, “Typhoid Fever”, “Paratyphoid Fever”, “Typhus Fever”, “Relapsing Fever”, “Shiga toxin-producing E. coli (STEC) without Hemolytic Uremic Syndrome (HUS)”, “Vibrio Infection (non-Cholera)”, “Shiga Toxin Positive Feces (without culture confirmation)”,“Yersiniosis”) 6. Infectious_complications <- c(“Hemolytic Uremic Syndrome (HUS) without evidence of Shiga toxin-producing E. coli (STEC)”,“Hemolytic Uremic Syndrome(HUS)”, “Shiga toxin-producing E. coli (STEC) with Hemolytic Uremic Syndrome (HUS)”)

Years groups:

“2001-2003”, “2004-2006”, “2007-2009”, “2010-2012”, “2013-2015”, “2016-2018”

Tables and codes

# Tables

my_table_data <- ID_tableyears_group_total %>%
  select(c("ID_type","region","rate","time_period")) %>%
  filter(ID_type=="Bacterial"|ID_type== "Parasitic"|ID_type=="Fungal"|ID_type=="Viral") %>%
  filter(region=="California")%>%
  drop_na(rate) %>%
  group_by(ID_type,time_period,region) %>%
  summarise(cumm_rate = sum (rate))
## `summarise()` regrouping output by 'ID_type', 'time_period' (override with `.groups` argument)
kable(my_new_table_data, 
      booktabs=T, 
      col.names=c("Time Period", " ","Bacterial", "Fungal", "Parasitic", "Viral"),  
      align='lccc', 
      caption="Infectious disease rates (Cases/100,000) over time by disease etiology (from 2001 - 2018 by 3 year increments)",
      format.args=list(big.mark=","))
Infectious disease rates (Cases/100,000) over time by disease etiology (from 2001 - 2018 by 3 year increments)
Time Period Bacterial Fungal Parasitic Viral
2001-2003 California 109.46 14.69 30.03 0.07
2004-2006 California 101.86 23.37 26.38 NA
2007-2009 California 102.86 21.00 24.11 0.10
2010-2012 California 108.82 36.58 20.97 0.55
2013-2015 California 124.43 22.77 21.15 1.04
2016-2018 California 142.98 52.33 27.80 1.76
# Table for  bay area only :
kable(my_new_bayarea_table, 
      booktabs=T, 
      col.names=c("Time_Period", " ", "Bacterial", "Fungal", "Parasitic", "Viral"),
      align='lccc', 
      caption="Infectious disease rates over time in the Bay Area from 2001-2018 by etiology of 
     disease and time period (3 year cummulatives)",
      format.args=list(big.mark=","))
Infectious disease rates over time in the Bay Area from 2001-2018 by etiology of disease and time period (3 year cummulatives)
Time_Period Bacterial Fungal Parasitic Viral
2001-2003 Bay_area 945.44 NA 342.40 NA
2004-2006 Bay_area 947.73 10.53 306.70 NA
2007-2009 Bay_area 932.71 9.76 254.58 NA
2010-2012 Bay_area 992.36 14.77 250.65 NA
2013-2015 Bay_area 1,136.90 20.89 228.59 NA
2016-2018 Bay_area 1,300.47 42.55 343.04 3.47

#Figures and codes:

Figure 1 shows that of the reported infectious diseases (excluding sexually transmitted diseases) that are most commonly reported are Bacterial diseases, followed by Fungal, and then parasitic diseases. Viral diseases have a lower rate. These numbers do not necessarily translates into real prevalence since many diseases are not considered “reportable”, due to their common prevalence and ubiquitous distribution. Ingeneral thorugh the years the frequency of reported bacterial, Fungal and viral diseases have increased, while Parasitic have decreased, except for 2016-2018 that shows an increasing trend.

## `summarise()` regrouping output by 'ID_type' (override with `.groups` argument)
## New names:
## * NA -> ...4
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Figure 2: This figure shows that since 2001, reports of bacterial diseases have increased overtime. Reasons for this increase could be related to a real increase of reportable cases, versus improved report methodology. The same goes to Fungal infections. Parasitic infections have decreased, except forthe period 2016-2018 that show an increase. Viral infection reports have increased since 2016 due to new viral reportable conditions like Zika and Chikungunya .

Table # 3 : Rate/100,000 of number of reportable infectious diseases per year in California during 2001-2018, by disease type
Type of Infectious Disease reported
Rate/100,000
year bacterial fungal parasitic viral
2001 37.21 4.32 11.73 0.08
2002 38.47 4.60 9.69 0.01
2003 34.13 5.77 8.64 0.03
2004 33.75 7.10 8.67 0.01
2005 34.35 7.89 8.88 0.03
2006 34.21 8.38 8.90 0.04
2007 32.91 8.05 8.75 0.03
2008 35.54 6.48 7.77 0.02
2009 34.73 6.47 7.61 0.11
2010 36.99 11.88 7.27 0.22
2011 33.39 13.87 6.91 0.11
2012 38.76 10.83 6.84 0.26
2013 38.32 8.65 6.95 0.34
2014 41.53 5.99 6.79 0.34
2015 44.85 8.13 7.50 0.37
2016 43.02 14.13 8.73 1.69
2017 48.56 19.33 9.45 0.77
2018 51.75 18.87 9.70 0.51
Data Sources
Data from https://data.chhs.ca.gov/dataset/infectious-disease

Table 3 : This table shows the values of the reported cases/100,000 by infectious disease type during 2001-2018 (same as Figure 2 )

Figure 3 : Among the bacterial infections, the most commonly reported one is Campilobacteriosis, followed by Salmonellosis and Shiguellosis.

Figure 4: The most common parasitic disease is Giardiasis, followed by Amebiasis and cryptosporidiosis.

Figure 5 : Among viral infections, the most commonly reported was Dengue virus infection. The newly described virus Chikungunya and Zika virus were not reported in California until 2017

##Analyzing STDs:

#Methods: The visualiztion that I would like to create is a graph rates of each disease type (bacteria, virus, std etc) overtime within the Bay Area. To do this, I first filtered both the group dataset and the STD dataset to only Bay Area Counties (“Alameda”, “Santa Clara”, “San Mateo”, “San Francisco”, “Marin”, “Contra Costa”, “Solano”)

ggplot(hope, aes(x=Year, y=Overall_Rate))+facet_wrap(vars(Disease_Type), ncol = 2, scales = "free_y")+geom_line(aes(color = Disease_Type))+labs(x="Year", y="Overall Rate (per 100K)", title = "Figure 7: Overall Rates of Infectious diseases (including STDs) in Bay Area Counties from 2001-2018") + theme_minimal()

Interpretation of graph: This graph looks at the overall rates per year of different types of infectious disease from 2001 to 2018 in the Bay area counties, with the Y axis adjusted. I created this graph to better visualize the trends. From the graphs it is noticeable that fungal and bacterial rates are increasing overtime, along with STD rates.

ggplot(std_set, aes(x=Year, y=Overall_Rate))+geom_line(aes(color = Sex))+labs(x="Year", y="Overall Rate (per 100K)", title = "Figure 8: Overall rates per year of STDs (Bacterial) in Bay Area Counties from 2001-2018 seperated by Sex") +theme_minimal()

Interpretation of graph: This graph looks at the Overall rates per year of bacterial STD infectious disease in the Bay area counties from 2001 to 2018, seperated by Sex. I created this graph to better visualize the trends in STDs between males and females.The graphs shows a very significant increase in the overall rate of STDs for both male and females. Prior to around 2014, it seems that female rates were higher than male rates. However from around 2014 and onward, we see an even greater increase in male rates.

tabledataSTD<-individualdatafinal%>%filter(Sex %in% "Total")%>%group_by(Disease, Year)%>%summarise(STD_case_total=sum(Cases))
## `summarise()` regrouping output by 'Disease' (override with `.groups` argument)
newSTD<-left_join(testin, tabledataSTD, by= "Year")
newSTD$Overall_Rate<-(newSTD$STD_case_total/newSTD$totalp)*100000
newSTD1<-newSTD%>%select(1,3,5)%>% pivot_wider(names_from="Year",values_from= "Overall_Rate")%>%select(1, 15,16,17,18,19)
formattable(newSTD1)
Disease 2014 2015 2016 2017 2018
Chlamydia 410.80605 468.21643 486.88633 537.18960 571.75121
Early Syphilis 26.19914 29.66732 31.23926 38.29107 41.31231
Gonorrhea 132.25710 168.62095 189.39598 217.44363 222.31496

Interpretation of table: This table is a visualization of the rates of STDs in the Bay Area over the last 5 years that were included in the dataset (2014-2018). As shown in the table, rates of all three STDs are increasing significantly each year.